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Abstract. We present recent results on two attempts at understanding and 
utilizing large-scale simulations of magnetic materials. In the first study we 
consider massively parallel implementations on a Cray T3E of the n-fold way 
algorithm for magnetization switching in kinetic Ising models. We find an in- 
tricate relationship between the average time increment and the size of the spin 
blocks on each processor. This narrows the regime of efficient implementation. 
The second study concerns incorporating noise into micromagnetic calculations 
using Langevin methods. This allows measurement of quantities such as the 
probability that the system has not switched within a given time. Preliminary 
results are reported for arrays of single-domain nanoscale pillars. 

1. Introduction 

To model realistic magnetic systems of interest in, for example, the magnetic 
recording industry requires that a number of difficult simulational problems be 
addressed. Preliminary results on two such problems are presented here. 

Simulating metastable decay involves long characteristic time scales (the 
metastable lifetime), and several sophisticated algorithms have been developed 
for serial computers [1,2,3]. A common testbed for these algorithms is the 
kinetic Ising ferromagnet below its critical temperature, T c , which exhibits 
slow metastable decay after the reversal of the external magnetic field [4] . This 
model is appropriate for the study of highly anisotropic single-domain nanopar- 
ticles and thin films [5] . Even with these sophisticated algorithms the computer 
power required is enormous. Efficient computation requires these algorithms to 
be scalable and effectively implemented on massively parallel computers. We 
present an experiment on the parallelization of n-fold algorithms. 

For less anisotropic magnetic materials, continuous-spin models should be 
simulated. To model metastable decay at finite temperature and measure time 
dependent quantities of experimental interest requires extensions of normal 
micromagnetic calculations. We report our first Langevin micromagnetic cal- 
culation, designed to model arrays of pillars grown with an STM technique 
[6]. 

2. Parallelization of the n-fold way algorithm 

We present and analyze a variation [7] of the n-fold way algorithm [1,2] for 
magnetization switching in the kinetic Ising model on a distributed-memory 



parallel computer. The implementation of efficient massively parallel algo- 
rithms for Monte Carlo simulations is an interesting and challenging problem, 
which is one of the most complex ones in parallel computing. It belongs to the 
class of parallel discrete-event simulation (sometimes referred to as distributed 
simulation) which has numerous applications in engineering, computer science, 
and economics, as well as in physics [8]. These dynamics, which obviously 
contain a substantial amount of parallelism, were traditionally simulated on 
serial computers. Paradoxically, it is difficult to implement an efficient parallel 
algorithm to simulate them, mainly due to the fact that the discrete events are 
not synchronized by a global clock. 

The kinetic Ising model, either with the standard integer-time updates 
or with Glauber's continuous-time interpretation, was believed to be inher- 
ently serial. Contrary to that belief, Lubachevsky presented a method for 
parallel simulation of these systems [7] without changing the underlying dy- 
namics. Also, he proposed a way to incorporate the n-fold way algorithm, 
possibly giving further speedup. We implement his algorithm on the isotropic, 
square-lattice Ising model with periodic boundary conditions and Hamiltonian 
W=— J^/ij) s i s j — H^2 t Si. Here J>0 is the ferromagnetic nearest-neighbor 
spin-spin interaction and H is the external field. To study metastable de- 
cay, all spins are initialized in the +1 state, and we apply a negative mag- 
netic field at constant T < T c . In the corresponding serial algorithm we use 
the single-spin-flip Metropolis rates, where the probability to flip a spin is 
p=min{l, exp(— AH)}. In the rejection-free n-fold way update scheme, a flip is 
always performed, and the time is incremented appropriately. One must then 
introduce the notion of spin classes which carry the state of the spin itself and 
its neighbors. In the above model there are ten classes, characterized by the 
number of spins in class i, rii, and the flipping probability of a spin in class 
i, Pi. Since the classes are disjoint, 5Z i= i n i = L 2 , where L is the linear size 
of the lattice. To perform an update, first a class is chosen according to the 
relative weights {njPj}J£ 1; then one of the spins in the class is picked with 
equal probability, 1/nj. Once the class information, in particular the rij's, have 
been updated, the time of the next update is determined. The time increment 
is a random quantity, given by — \n(r)L 2 / riiPi, where r is a uniformly 
distributed random number in (0,1]. For integer-time updates, the only dif- 
ference is that one must draw the time increments from a discrete geometrical 
distribution instead of the continuous exponential one [2] . 

To parallelize the above algorithm, the LxL lattice is spatially decomposed 
into (L/l) 2 blocks of size Ixl. On a parallel computer, each processing element 
(PE) carries an Ixl block of spins and the number of PEs is Npe~(L/1) 2 . How- 
ever, one cannot simply run a copy of the serial n-fold way algorithm on each PE 
without the possibility of corrupting the history of neighboring PEs. On each 
PE an additional class is defined which contains the spins on its boundary. The 
relative weight of this class is the number of spins on the boundary, Nt,=4(l — 1) , 
which clearly does not change during the simulation. The original tabulation 



of spins is only used in the kernel of the block. Hence, A^f, + n i—^i where 
N=l 2 is the total number of spins in a block. The update scheme differs from 
the original (continuous-time) algorithm in the following steps: (i) if the spin 
chosen belongs to the boundary, then the updating PE must wait until its lo- 
cal time becomes less or equal than that of its neighboring PEs (at most two 
in two dimensions). Then the state of this spin may or may not change: its 
flipping probability is determined by the usual Metropolis rates, (ii) once an 
update is completed, the time of the next update is determined by the local 
time increment, 




number of processing elements number of processing elements 

Fig. 1 Speedup measurements for the parallel code as a function of the number 
of processing elements, iVpE- (a) For fixed system size, L=512, the block size, 
I, decreases with increasing Np&. (b) For fixed block size, l=Q4 and 128, the 
system size L is increasing with increasing Npe- 

It is clear from the above algorithm that at any given (wall clock) mo- 
ment different PEs generally have different local simulated times. The "wait 
until" control structure in (i), however, ensures that the information passed be- 
tween neighboring PEs does not violate causality [7] . The above asynchronous 
algorithm is suitable for a continuous-time update scheme, but it can cause 
inconsistency when integer time is used. Then, to ensure the reproducibility 
of a simulated path, provided the same set of random seeds are used, explicit 
barrier synchronization should be incorporated (synchronous algorithm). 

We implement the above versions of the n-fold way algorithm on the Cray 
T3E parallel computer at NERSC, using the Cray-specific, logically shared, 
distributed memory access (SHMEM) routines for message passing. The fast 
SHMEM library supports communication initiated by one PE, together with 
remote atomic memory operations. Without these features, it would be ex- 
tremely inconvenient to code an algorithm for stochastic simulation on a dis- 
tributed memory machine, where the communication pattern is completely un- 



predictable. These characteristics outweigh the loss of portability of our code. 
Details on the implementation will be published elsewhere [9] . 

We note some inherently weak features, which are not related to the fast 
communication hardware of the parallel architecture. First, as a general guide- 
line, the fewer communications needed, the better the performance of the par- 
allel code. In our case, the probability to pick a spin on the boundary, which 
is ultimately followed by some communication, is greater than the surface-to- 
volume ratio. It is determined by the relative weights in the modified n-fold way 
algorithm, Nb/(Nb + J2i=i n iVi)- With very small p^s this ratio can become 
close to 1, leading to more frequent message passing and idling as required by 
the "wait until" condition. Second, the average time increment in Eq. (1) is 
not bounded by p~*„ as in the serial n-fold way algorithm, but by 

1 I 2 

{At)max = N b /N + p min N k /N < 40 - !) ' (2) 

where Nk is the number of spins in the kernel. Hence, however small the flipping 
probabilities, the average local time increment is limited by approximately 1/4. 
Consequently, reasonable performance requires lp m i n >^\. 

We tested the scaling of the code (both asynchronous and synchronous 
versions) up to 256 PEs at T=0.7T C and |i2"|/T=0.18, in two different ways. 
First, the system size is kept constant (L=512), and we divide it into smaller 
and smaller blocks (Fig. la). Second, we keep the block size fixed (Z=64, 128), 
and study larger systems by increasing the number of blocks (Fig. lb). We 
determine the efficiency and speedup by comparing with the serial n-fold way 
performed on one T3E node. The results reflect the features discussed in the 
previous paragraph. In the first case we observe poor scaling, due to drastically 
decreasing average time increments and a slightly decreasing utilization ratio. 
In the second case, the average time increments are not affected while the uti- 
lization saturates, leading to reasonably good scaling. The larger the block size, 
the better the performance. For the continuous time, asynchronous algorithm, 
with Z=128 and using 256 PEs, the speedup is 100. It can be systematically 
improved by taking larger I values. However, the memory of a PE is not un- 
limited: the largest cell size that we could allocate in a T3E node was /=1400. 
The asynchronous algorithm suits this distributed-memory architecture best. 

The practical applicability of our implementation is obviously driven to 
large systems. The narrow regime of efficient implementation is due to the 
introduction of a special class in the n-fold way algorithm which "shields" the 
blocks from each other. The algorithm avoids rollbacks, but pays a large price: 
it looses the most important feature of the serial n-fold way algorithm — the 
arbitrarily large time increments at arbitrarily low temperature and field. The 
only way to preserve the advantage of the original n-fold way algorithm is to 
apply it directly on each block, together with a complex rollback procedure 
[10]. Work is in progress to incorporate it in our simulations of metastable 
decay. 



Fig. 2 A single snap-shot of a Langevin micromagnetic calculation for 
magnetization reversal in a square array of Ni pillars that are 200 nm apart, 
200 nm tall, and have a diameter of 40 nm. Each pillar is discretized using 
5 lattice points. (The vertical scale of this figure is enhanced compared with 
the horizontal scale for clarity of presentation.) The temperature is 300 K, 
the spins are initially up, and the applied field is down with a magnitude of 
1225 Oe. This is at a time of 14 nsec following the reversal of the field. The 
integration time step is At=l psec. Note that this looks very different from a 
coherent rotation mode of spin reversal. 

3. Langevin Micromagnetics 

In order to simulate systems for which the Ising model is not a faithful rep- 
resentation, we have programmed a Langevin micromagnetics code similar to 
that reported in [11]. With a phenomcnological damping parameter a, and 
classical spins of constant length given by the bulk saturation magnetization 
M s , at each lattice site i we have a scaled magnetization m = M s /M s . The 
standard Ginzburg-Landau-Lifshitz micromagnetic equation [12,13] is 

drfii 1 _ \ 

— 7T = ~— a m « X "*.eff + am i X KeS ■ ( 3 ) 

at 1 + cr \ J 

The scaled field at each site, hi^s, contains contributions from terms including 
the dipolc-dipole interaction, the exchange interaction, the interaction due to 
crystaline anisotropy, the applied field, and a scaled noise term proportional to 
the the Langevin fields ((t) [12,13]. In our case the Langevin noise term ( and 
the integration time step At are related by CcxV At, which we think is more 
physical than the £ocl/\/At of [11]. Even though the set of equations used 
in this Langevin micromagnets simulation are approximations to the actual 
equations [14], the approximation should be reasonable well below the critical 
temperature. In order to keep the length of the m; constant, we have used 



a fourth-order Rungc-Kutta algorithm. Fig. 2 shows the type of simulations 
[15] that can be performed for arrays of magnetic pillars that can be built and 
measured experimentally [6] . The importance of the finite temperature thermal 
fluctuations and the rotation mode, which is very different from that of uniform 
rotation, can be seen in this figure. Note that with standard time-independent 
micromagnetic calculations there would be no magnetization reversal, since the 
strength of the applied field is smaller than that of the nuclcation field. 
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